Author: Yap Jheng Khin

FYP II Title: Used car dealership web application

Purpose:

  1. This notebook explains:
    • The setup and validation of SHAP tree explainers.
    • The description and interpretation of beeswarm plot, feature importance bar plot, model loss bar plot, and model monitoring plot that are visualized at the model level (global explanation).
    • The description and interpretation of SHAP bar plot and SHAP loss bar plot that are visualized at the sample level (local explanation).
  2. Input:
    • Lead scoring dataset. The dataset information and preprocessing steps are explained in the FYP2_Lead_Scoring_Model_Training.ipynb.
    • Dictionary containing the extracted tree weights for lead scoring adaptive random forest model.
    • The lead scoring model predictions in npy format. Note that the model prediction is not performed here since it requires River library, which is not compatible with SHAP library.
  3. Ouput
    • HTML codes and files that contain visualizations that are exported by Plotly.

Execution time: At most 3 minutes in Jupyter Notebook.

Note: Execute all the cells again every time the notebook is re-opened to re-display the Plotly visualization.

Setup

Ensure that the current Python interpreter path is correct. For example, if the SHAP conda environment is named as arf_conda_exp_env, the expected sys.executable should be C:\Users\User\miniconda3\envs\arf_conda_exp_env\python.exe.

In [1]:
import sys
print(sys.executable)
C:\Users\User\miniconda3\envs\arf_conda_exp_env\python.exe
In [2]:
# Standard libraries
import os
import pandas as pd
import pickle
import numpy as np
from scipy import stats as scipy_stats
import shap
import json
import time

import plotly.express as px
from shap.plots import colors as shap_colors

# User-defined libraries
from data_preprocessing import DataPreprocessor
from explainer_visualization import prettify_feature_labels, \
shap_api_beeswarm_plot, get_shap_color_scale, get_beeswarm_plot_data, \
plot_dataman_bar_plot, get_feature_importance_plot_data, get_shap_loss_plot_data, \
plot_shap_monitoring, plot_shap_monitoring_plot_plotly, plot_local_bar_plot_plotly, \
get_shap_monitoring_plot_data
from general_utils import deserialize_arf

User Configuration

Set the AUTO_OPEN to True if the reader wants the HTML visualizations to automatically open in the new tab for preview.

Set the PRINT_HTML to True if the read wants to print the HTML contents in the output. Note that some HTML contents is quite long.

In [3]:
AUTO_OPEN  = False
PRINT_HTML = False
In [4]:
RANDOM_SEED = 2022
OUT_PARENT_DIR = 'outputs'
OUT_CHILD_DIR = 'lead_scoring_explainer'
OUT_FOLDER_PATH = os.path.join(OUT_PARENT_DIR, OUT_CHILD_DIR)

# Create directory if not exist
os.makedirs(OUT_FOLDER_PATH, exist_ok=True)
In [5]:
shap.initjs()

Load Models

The dictionary containing the extracted weights for the lead scoring model is loaded instead of the lead scoring model itself. The details of the dictionary can be found in FYP2_ARF_to_Dict_Conversion.ipynb.

In [6]:
# Load lead scoring ARF dictionary and its data preprocessor
with open('outputs/lead_scoring/arf_cf.json', 'r') as f:
    ls_arf_dict_serializable = json.load(f)

ls_arf_dict = deserialize_arf(ls_arf_dict_serializable)

# Load data preprocessor
with open('outputs/lead_scoring/data_preprocessor.pkl', 'rb') as f:
    ls_data_pp = pickle.load(f)

Load Data

In this notebook, the lead information used to seed the database is loaded for demonstration purpose. The lead information is filterd based on the following application logic:

The Status's initial value is Active when the users add the record. The users will set the Status to Qualified if the lead conversion is successful and set to Disqualified if the opposite case is true.

This means that:

  1. SHAP tree explainer will only consider the records which have the Status value of Qualified or Disqualified.
  2. The PSI and Chi-square statistics and will only consider the records which have the Status value of Active.

In other words, the truth output is available if and only if the Status is either Qualified or Disqualified. Note that the application logic is discussed in details in the FYP2 report.

The output below shows that half of the lead records contains the truth and vice versa. The lead records with the truth will be used for testing purposes. The result should be the same in the web application since these lead records are the exact data that seeded into the web application.

In [7]:
# Load lead records
lead_records = pd.read_csv(f'outputs/seed_data/lead_records.csv')

# Remove columns that are not needed to perform inference
ls_test = lead_records.copy().drop(columns=['pred_score', 'lead_status'], axis=1)

ls_target_attr = 'converted'
ls_X_test  = ls_test.drop(columns=ls_target_attr, axis=1)
ls_y_test  = ls_test[ls_target_attr]

# Preprocess the data
ls_X_test  = ls_data_pp.preprocess(ls_X_test)
In [8]:
truth_available = lead_records['lead_status'] != 'Active'

ls_X_test_truth_av = ls_X_test.loc[truth_available, :].copy()
ls_y_test_truth_av = ls_y_test[truth_available].copy()
ls_X_test_no_truth = ls_X_test.loc[~truth_available, :].copy()

print(f'The no. of lead records that have the truth: {ls_X_test_truth_av.shape[0]}')
print(f'The no. of lead records that don\'t have the truth: {ls_X_test_no_truth.shape[0]}')
The no. of lead records that have the truth: 652
The no. of lead records that don't have the truth: 653
In [9]:
ls_train = pd.read_csv(f'outputs/lead_scoring/train_set.csv')
# Shuffle the lead scoring records
rng = np.random.default_rng(RANDOM_SEED)
ls_train = ls_train.reindex(rng.permutation(ls_train.index))
ls_train = ls_train.reset_index(drop=True)

ls_target_attr = 'converted'
ls_X_train = ls_train.drop(columns=ls_target_attr, axis=1)
ls_y_train = ls_train[ls_target_attr]

# Preprocess the data
ls_X_train  = ls_data_pp.preprocess(ls_X_train)

Tree Explainer Initialization

There are two approaches to build a tree SHAP explainer which are tree_path_dependent approach and interventional approach. interventional approach is much slower than tree_path_dependent approach. As a result, tree_path_dependent approach is always used except for explaining model loss. However, based on the analysis from FYP2_Explainer_Validation.ipynb, only the tree SHAP explainer with the interventional approach is used.

To initialize a tree SHAP loss explainer to monitor model loss, three conditions must be met as follows:

  1. interventional approach must be used by setting feature_perturbation to interventional.
  2. The parameter named model_output must be set to log_loss.
  3. Background dataset must be provided when building the explainer. Scott Lundberg, who is the author of the SHAP library, does not recommend to provide more than 1000 random background samples since the running time scales linearly with the number of samples. The number of background samples can be as little as 100.
  4. The truth outputs must be provided when calculating the SHAP values.

Randomly choose 100 samples from the test set to provide as a background dataset for the explainer.

In [10]:
num_subsamples = 100

ls_X_test_truth_av_subsample = pd.read_csv('outputs/lead_scoring/X_test_truth_av_subsample.csv')

print(f'The size of the subsample: {ls_X_test_truth_av_subsample.shape}')
The size of the subsample: (100, 9)

Build a tree SHAP explainer that uses interventional approach.

In [11]:
ls_tree_explainer = shap.TreeExplainer(model = ls_arf_dict, 
                                       feature_perturbation = 'interventional', 
                                       data = ls_X_test_truth_av_subsample)

ls_shap_values = ls_tree_explainer.shap_values(ls_X_test_truth_av)

Build a tree SHAP loss explainer that uses interventional approach and the model_output is set to log_loss.

In [12]:
ls_tree_loss_explainer = shap.TreeExplainer(model = ls_arf_dict, 
                                            feature_perturbation = 'interventional', 
                                            model_output = 'log_loss',
                                            data = ls_X_test_truth_av_subsample)

ls_shap_loss_values = ls_tree_loss_explainer.shap_values(ls_X_test_truth_av, ls_y_test_truth_av)

Validation

The expected value is calculated differently depending on the following approach:

  1. interventional approach: If the model_output is not set to log_loss, The expected value is exactly the average of all the model predictions on the provided background dataset, as documented in the source code. Else if the model_output is set to log_loss, the expected value is a function that returns the value based on the given label value, as documented in the source code.
  2. tree_path_dependent approach: Since the background dataset is not available, the expected value is calculated by averaging the prediction values at the root node across all base learners, as documented in the source code.

Note that the background data should not be given if the tree_path_dependent approach is used since the background dataset will not always cover all the leaves in the model even the full training dataset is given.

Execute the code below and it will throw an assertion error and inform you to either switch to larger background dataset or use interventional approach.

ls_tree_exp_data = shap.TreeExplainer(model = ls_arf_dict, data = ls_X_train, 
                                      feature_perturbation='tree_path_dependent')
ls_shap_values_2 = ls_tree_exp_data.shap_values(ls_X_test_truth_av)

The three tests below validate the SHAP explainer that uses interventional approach. Note that the model's prediction output is imported since River library is not available.

The three tests below are only valid if and only if the explainer uses interventional approach and the model_output is set to raw. There is no tests for the SHAP loss explainer that uses interventional approach and the model_output is set to log_loss. As of April 2022, the official documentation only provide an example of validation test on XGB gradient boosting classifier, but not classifiers like random forest classifier. The validation test cannot be used in both Scikit-learn random forest classifier and River adaptive random forest classifier since the SHAP loss value calculation is different for random forest models and gradient boosting models. Hence, the proposed three tests also serve as a proxy test for the explainer that uses interventional approach with the model_output set to log_loss. Furthermore, the additivity check is only conducted when the explainer uses tree_path_dependent approach or when the explainer uses interventional approach and the model_output is set to raw. In other words, there is no standardized test on validating the ingestion of tree classifiers into the explainers that use the interventional approach.

Test 1: Ensure that the expected value is equivalent to the average of all the model predictions on the provided background dataset.

The background dataset can be set to more than 100 samples. However, the test is not accurate since the average of model predictions is not computed on the whole dataset. If the provided sample count exceeds 100, the Tree SHAP initialization function will resample the background dataset down to 100 samples as documented in the source code. It is because the object that mask the dataset, shap.maskers._tabular.Independent class instance, is initialized with 100 max samples and cannot be adjusted by the API users (as of SHAP 0.40.0). Hence, only a maximum of 100 samples out of the background dataset is used in calculating the expected value.

For sample count that exceeds 100, ensure that the expected value is equivalent to the average of all the model predictions on the resampled background dataset. To get the masked/resampled dataset, execute the code below.

ls_tree_explainer.masker.data

In [13]:
with open('outputs/lead_scoring/y_test_truth_av_subsample-prediction.npy', 'rb') as f:
    ls_y_test_truth_av_subsample_pred = np.load(f)

# Compute the average of River model prediction on the background dataset
true_y_pred_mean = ls_y_test_truth_av_subsample_pred.mean(0)

biggest_diff = np.abs(true_y_pred_mean.sum(0) - ls_tree_explainer.expected_value.sum(0)).max()

print(f'Is expected value == the average of all the model predictions on the background dataset'+\
      ' across all base learners?: '+\
      f'{biggest_diff < 1e-4}')
print(f'\n\tThe expected value in the explainer is {ls_tree_explainer.expected_value}.')
Is expected value == the average of all the model predictions on the background dataset across all base learners?: True

	The expected value in the explainer is [0.54051242 0.45948758].

Test 2: Ensure that the ingested SHAP model (a TreeEnsemble object) makes the same prediction probabilities as the original model.

In [14]:
# Original model's prediction output
with open('outputs/lead_scoring/y_test_truth_av-prediction.npy', 'rb') as f:
    ls_y_test_truth_av_pred = np.load(f)

# Original model's prediction output
true_y_pred = ls_y_test_truth_av_pred

# Ingested model's prediction output
y_pred_fr_ingested_model = ls_tree_explainer.model.predict(ls_X_test_truth_av)

largest_diff = np.abs(true_y_pred - y_pred_fr_ingested_model).flatten().max()

print(f'Does the ingested SHAP model make the same predictions as the original model?: '+\
      f'{largest_diff < 1e-4}')
print(f'\n\tThe largest difference is {largest_diff}.')
Does the ingested SHAP model make the same predictions as the original model?: True

	The largest difference is 4.440892098500626e-16.

Test 3: For each sample, ensure that the sum of the SHAP values equals to the difference between model output and the expected value.

In [15]:
# Original model's prediction output for positive class
# The shape is (number_of_samples, number_of_classes)
true_y_pred = ls_y_test_truth_av_pred

# Sum of the SHAP values and the expected value
# The expected value's shape is (number_of_classes, 1)
# The sum of SHAP values' shape is (number_of_classes, number_of_samples)
y_pred_calulated_fr_shap_val = \
ls_tree_explainer.expected_value[:, np.newaxis] + \
np.sum(np.array(ls_shap_values), axis=-1) 

largest_diff = np.abs(true_y_pred - y_pred_calulated_fr_shap_val.T).flatten().max()

print(f'model prediction output == expected value + sum of SHAP value of all features?: '+\
      f'{largest_diff < 1e-4}')
print(f'\n\tThe largest difference is {largest_diff}.')
model prediction output == expected value + sum of SHAP value of all features?: True

	The largest difference is 2.3336163335052618e-08.

Utilities

The author used Plotly library instead of using the SHAP library to construct plots like beeswarm plot and waterfall plot.

It is because Plotly provides functionalities as shown below:

  1. Building on top of Plotly JavaScript library (plotly.js), the plot can be constructed and configured in Python (plotly.py) before converting to JavaScript code. The plot can then be exported to a div element and display within a HTML document.
  2. As compared to Chart.js, Plotly can enhance user experience with minimum development effort. It is because Plotly provides a rich set of functionalities such as image downloading, zooming, and data hovering with little or no configuration.

All the plots are inspired from the SHAP library maintained by Scott Lundberg. The author has to re-implement the visualization in Plotly since Scott Lundberg uses Matplotlib. For some plots like beeswarm plot, the author deliberately extracts Lundberg's code into a separate function to understand code logic and validate the re-implementation.

The code below shows the function(s) needed for all the visualizations below.

The rest of the utilities functions can be found in explainer_visualization.py.

In [16]:
feature_label_names = {
    'total_site_visit': 'Total Site Visit', 
    'total_time_spend_on_site': 'Total Time Spend On Site',
    'avg_page_view_per_visit': 'Average Page View Per Visit', 
    'dont_email_Yes': 'Don\'t Email_Yes', 
    'occupation_Businessman': 'Occupation_Businessman',
    'occupation_Student': 'Occupation_Student', 
    'occupation_Unemployed': 'Occupation_Unemployed',
    'occupation_Working Professional': 'Occupation_Working Professional', 
    'received_free_copy_Yes': 'Received Free Copy_Yes'
}

Global Explanation

Beeswarm plot

Using explainer with tree_path_dependent approach, the SHAP values of each sample are calculated and individually plotted as a single dot to form a dense data that is grouped by feature. To prevent data points plotting on top of one another, random jittering effect is applied to moved the data points away from one another to prevent overlapping. As a result, the dense data points look like a beeswarm instead of a straight line. Beeswarm plot can be used to visualize the feature importance and correlation for each individual sample.

The x-axis shows the SHAP values and the y-axis shows the X features. The feature with the highest summed absolute SHAP values is displayed at the top, followed by the next highest summed absolute SHAP values in a top-down fashion. A scatterplot is drawn for each feature, where the feature value of each sample is plotted as a point. The feature value is represented by the colour ranging from blue to red. The more reddish the point's colour, the higher the feature value for that sample. The opposite case is true for the bluish colour. The continous colour scale allows the viewer to infer whether the feature is positively or negatively correlated with the prediction output.

Below is some examples on how to interpret the beeswarm plot:

  1. Total Time Spend On Site is the most important feature in determining the price.
  2. Features like Total Time Spend On Site and Occupation_Working Professional are positively correlated with lead score. It is because most red points are concentrated on the right side and most blue points are concentrated on the left side.
  3. Feature like Occupation_Unemployed, Don't Email_Yes, and Occupation_Student are negatively correlated with lead score. It is because most red points are concentrated on the left side and most blue points are concentrated on the right side.
  4. Higher Total Time Spend On Site correlates with higher lead score. The model predicts high lead score when the value of the Total Time Spend On Site is high.
  5. The Occupation_Unemployed feature negatively correlates with the lead score. The model predicts lower lead score when the value of the category value of Occupation is Unemployed.

Note that the categorical feature is formatted such that the categorical feature and its categorical value separated with an underscore.

Below is the Plotly implementation of beeswarm plot.

In [17]:
axis_color="#333333"

shap_color_scale = get_shap_color_scale(shap_colors.red_blue)

beeswarm_plot_data, y_tick_labels = get_beeswarm_plot_data(ls_X_test_truth_av, ls_shap_values[1])

y_tick_labels = prettify_feature_labels(y_tick_labels, feature_label_names)

fig = px.scatter(beeswarm_plot_data, color_continuous_scale = shap_color_scale,
                 x='shap_values', y='y_locations', color='color_values', 
                 labels=dict(shap_values = 'SHAP value (impact on model output)', y_locations="")
                )

fig.update_layout(
    width = 750,
    height = 1000,
    showlegend = False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis = dict(
        ticks = 'outside',
        # Set the tick color and the tick label color to lighter dark colour
        tickcolor = axis_color,
        tickfont_color = axis_color,
        showline = True,
        linewidth = 1, 
        linecolor = axis_color
    ),
    yaxis = dict(
        tickmode = 'array',
        tickvals = list(range(len(y_tick_labels))),
        # Set the tick color and the tick label color to lighter dark colour
        ticktext = y_tick_labels,
        tickfont_color = axis_color,
        range = [-1, len(y_tick_labels)]
    ),
    coloraxis = dict(
        colorbar_title_text = 'Feature value',
        colorbar_title_side = 'right',
        colorbar_tickmode = 'array',
        colorbar_tickvals = [0.0, 1.0],
        colorbar_ticktext = ['Low', 'High'],
        colorbar_thickness = 15
    )
)

fig.update_traces(
    customdata = beeswarm_plot_data["feature_values"],
    hovertemplate = \
    '<i>Feature value</i>: %{customdata:.0f}<br>'+
    '<i>SHAP value</i>: %{x:.4f}<br>'
)

fig.add_shape(
    type = "line",
    x0 = 0, y0 = -1, x1 = 0, y1 = len(y_tick_labels),
    line=dict(
        color = axis_color, 
        width = 1
    ),
    opacity = 0.5
)

fig.show()

Below is the Matplotlib implementation of beeswarm plot extracted from Lundberg's code.

Run the code below to directly compare the re-implementaton with the API instead.

shap.summary_plot(ls_shap_values, ls_X_test_truth_av)

In [18]:
shap_api_beeswarm_plot(features = ls_X_test_truth_av, shap_values = ls_shap_values[1])

The Plotly figure is exported to a HTML document for preview.

In [19]:
fig.write_html(
    file = f'{OUT_FOLDER_PATH}/beeswarm_plot_full.html', 
    include_plotlyjs=True, 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=True, 
    auto_open=AUTO_OPEN, 
)

The Plotly figure is exported to a div element. A script tag that contains the JavaScript code to construct the Plotly figure in that div element is also exported as well.

In [20]:
fig_html = fig.to_html(
    include_plotlyjs='cdn', 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=False, 
)

print(f'The size of the HTML code is {round(len(fig_html.encode("utf-8"))/1000, 1)}KB.')
if PRINT_HTML:
    print(fig_html)
The size of the HTML code is 301.5KB.

Feature importance bar plot

Using explainer with tree_path_dependent approach, the SHAP values of each sample are calculated and summed up by feature. Bar plot can be used to visualize the feature importance and correlation by calculating the sum of the absolute SHAP values for each feature.

The x-axis shows the SHAP values and the y-axis shows the X features. The feature with the highest summed absolute SHAP values is displayed at the top, followed by the next highest summed absolute SHAP values in a top-down fashion. The red bar represents positive correlation between the feature and the target while the blue bar represents the negative correlation between the feature and the target.

Below is some examples on how to interpret the beeswarm plot:

  1. Total Time Spend On Site is the most important feature in determining the price. It is because Total Time Spend On Site has the longest bar length.
  2. Features like Total Time Spend On Site and Occupation_Working Professional are positively correlated with lead score. It is because the bars are coloured in red.
  3. Features like Occupation_Unemployed, Don't Email_Yes, and Occupation_Student are negatively correlated with lead score. It is because the bars are coloured in blue.

Note that the categorical feature is formatted such that the categorical feature and its categorical value separated with an underscore.

In [21]:
fi_plot_data = get_feature_importance_plot_data(ls_X_test_truth_av, ls_shap_values[1])

y_tick_labels = prettify_feature_labels(fi_plot_data['Variable'], feature_label_names)

fig = px.bar(fi_plot_data, 
             x='SHAP_abs', 
             y='Variable', 
             orientation='h', 
             labels=dict(SHAP_abs = 'Sum of absolute SHAP values (red: positive correlation)', 
                         Variable=""))

fig.update_layout(
    width = 800,
    height = 600,
    showlegend = False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis = dict(
        ticks = 'outside',
        # Set the tick color and the tick label color to lighter dark colour
        tickcolor = axis_color,
        tickfont_color = axis_color,
        showline = True,
        linewidth = 1, 
        linecolor = axis_color, 
        range = [-np.max(fi_plot_data['SHAP_abs']) * 0.03, np.max(fi_plot_data['SHAP_abs']) * 1.1]
    ),
    yaxis = dict(
        # Set the tick color and the tick label color to lighter dark colour
        tickmode = 'array',
        tickcolor = axis_color,
        tickfont_color = axis_color,
        tickvals = list(range(len(fi_plot_data['Variable']))),
        ticktext = prettify_feature_labels(fi_plot_data['Variable'], feature_label_names),
        range = [-1, fi_plot_data.shape[0]]
    )
)

fig.add_shape(
    type = "line",
    x0 = 0, y0 = -1, x1 = 0, y1 = fi_plot_data.shape[0],
    line=dict(
        color = axis_color, 
        width = 1
    ),
    opacity = 0.5
)

fig.update_traces(hovertemplate = '<i>Feature:</i> %{y}<br><i>Sum of SHAP absolute values</i>: %{x:.4f}<br>')

fig.update_traces(
    marker_color = fi_plot_data['Sign'], 
)

fig.show()

Below is the Matplotlib implementation of bar plot extracted from Lundberg's code. The author adds an additional plot logic such that the red bar represents positive correlation between the feature and the target. The opposite case is true for the blue bar. The plot is inspired from Dr. Dataman's article, which can be found at https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d

Run the code below to directly compare the re-implementaton with the API instead.

shap.summary_plot(ls_shap_values, ls_X_test_truth_av, plot_type="bar")

In [22]:
plot_dataman_bar_plot(ls_X_test_truth_av, ls_shap_values[1])

The Plotly figure is exported to a HTML document for preview.

In [23]:
fig.write_html(
    file = f'{OUT_FOLDER_PATH}/fi_bar_plot_full.html', 
    include_plotlyjs=True, 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=True, 
    auto_open=AUTO_OPEN, 
)

The Plotly figure is exported to a div element. A script tag that contains the JavaScript code to construct the Plotly figure in that div element is also exported as well.

In [24]:
fig_html = fig.to_html(
    include_plotlyjs='cdn', 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=False, 
)

print(f'The size of the HTML code is {round(len(fig_html.encode("utf-8"))/1000, 1)}KB.')
if PRINT_HTML:
    print(fig_html)
The size of the HTML code is 9.5KB.

Model loss bar plot

Using explainer with interventional approach, the SHAP loss value of each sample are calculated. In every sample, the features with positive SHAP loss values increase the prediction error while features with negative SHAP loss values decrease the prediction error. According to Chuangxin Lin, a bar plot can be constructed to visualize the summed positive/negative SHAP loss value across all features. A positive SHAP loss bar plot can provide information on which features contribute the most model error, while a negative SHAP loss bar plot can provide information on which features help the most in reducing the model error.

To achieve better model performance, the data scientists should figure out ways to reduce both the positive SHAP loss values and negative SHAP loss values for as many samples as possible.

Note that the categorical feature is formatted such that the categorical feature and its categorical value separated with an underscore.

Positive model loss

Below is one of the ways to interpret the positive SHAP loss bar plot:

  1. Total Time Spend On Site contributes the most errors in the lead score predictions. Based on the beeswarm plot and the feature importance bar plot, Total Time Spend On Site is also contributing the most values in the predicted lead scores. This might suggest that the model relies too heavily on this feature to predict the lead score. Further model inspection is required by the data scientists to investigate the issue. The investigation is not covered since it is out of scope.
In [25]:
plotly_plot_data = get_shap_loss_plot_data(ls_X_test_truth_av, 
                                           ls_shap_loss_values[1], 
                                           shap_loss_type = 'positive', 
                                           plot = False)

fig = px.bar(plotly_plot_data, 
             x='feature', 
             y='shap_value', 
             labels=dict(shap_value = 'Sum of Positive SHAP loss value', feature=""))

fig.update_layout(
    width = 800,
    height = 700,
    showlegend = False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis = dict(
        ticks = 'outside',
        tickangle = -45,
        # Set the tick color and the tick label color to lighter dark colour
        tickcolor = axis_color,
        tickfont_color = axis_color,
        tickmode = 'array',
        tickvals = list(range(len(plotly_plot_data['feature']))),
        ticktext = prettify_feature_labels(plotly_plot_data['feature'], 
                                           feature_label_names),
        range = [-1, plotly_plot_data.shape[0]]
    ),
    yaxis = dict(
        # Set the tick color and the tick label color to lighter dark colour
        tickcolor = axis_color,
        tickfont_color = axis_color,
        showline = True,
        linewidth = 1, 
        linecolor = axis_color
    )
)

fig.update_traces(hovertemplate = '<i>Feature:</i> %{x}<br><i>Sum of SHAP loss values</i>: %{y:.4f}<br>')

fig.show()

Set the plot to True to display the bar plot using Pandas API. The source code of the plot is originated from Chuangxin Lin, which can be found at https://towardsdatascience.com/use-shap-loss-values-to-debug-monitor-your-model-83f7808af40f

In [26]:
_ = get_shap_loss_plot_data(ls_X_test_truth_av, 
                            ls_shap_loss_values[1], 
                            shap_loss_type = 'positive', 
                            plot = True)

The Plotly figure is exported to a HTML document for preview.

In [27]:
fig.write_html(
    file = f'{OUT_FOLDER_PATH}/pos_model_loss_plot_full.html', 
    include_plotlyjs=True, 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=True, 
    auto_open=AUTO_OPEN, 
)

The Plotly figure is exported to a div element. A script tag that contains the JavaScript code to construct the Plotly figure in that div element is also exported as well.

In [28]:
fig_html = fig.to_html(
    include_plotlyjs='cdn', 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=False, 
)

print(f'The size of the HTML code is {round(len(fig_html.encode("utf-8"))/1000, 1)}KB.')
if PRINT_HTML:
    print(fig_html)
The size of the HTML code is 9.3KB.

Negative model loss

Below is one of the ways to interpret the negative SHAP loss bar plot:

  1. Total Time Spend On Site helps the most in reducing the model error. Combining the insights from the positive SHAP loss bar plot, Total Time Spend On Site is relevant in predicting lead scores for some samples but not the others. Further model inspection is required by the data scientists to investigate why Total Time Spend On Site is irrelevant in predicting those samples. The investigation is not covered since it is out of scope.

Note that the result is not contradictory. It is because the positive SHAP loss bar plot only sums up positive SHAP values for each feature in each sample, while negative SHAP loss bar plot only sums up negative SHAP values.

In [29]:
plotly_plot_data = get_shap_loss_plot_data(ls_X_test_truth_av, 
                                           ls_shap_loss_values[1], 
                                           shap_loss_type = 'negative', 
                                           plot = False)

fig = px.bar(plotly_plot_data, 
             x='feature', 
             y='shap_value', 
             labels=dict(shap_value = 'Sum of Negative SHAP loss value', feature=""))

fig.update_layout(
    width = 800,
    height = 700,
    showlegend = False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis = dict(
        ticks = 'outside',
        side = 'top',
        tickangle = -45,
        # Set the tick color and the tick label color to lighter dark colour
        tickcolor = axis_color,
        tickfont_color = axis_color,
        tickmode = 'array',
        tickvals = list(range(len(plotly_plot_data['feature']))),
        ticktext = prettify_feature_labels(plotly_plot_data['feature'], 
                                           feature_label_names),
        range = [-1, plotly_plot_data.shape[0]]
    ),
    yaxis = dict(
        # Set the tick color and the tick label color to lighter dark colour
        tickcolor = axis_color,
        tickfont_color = axis_color,
        showline = True,
        linewidth = 1, 
        linecolor = axis_color
    )
)

fig.update_traces(hovertemplate = '<i>Feature:</i> %{x}<br><i>Sum of SHAP loss value</i>: %{y:.4f}<br>')

fig.show()

Set the plot to True to display the bar plot using Pandas API. The source code of the plot is originated from Chuangxin Lin, which can be found at https://towardsdatascience.com/use-shap-loss-values-to-debug-monitor-your-model-83f7808af40f

In [30]:
_ = get_shap_loss_plot_data(ls_X_test_truth_av, 
                            ls_shap_loss_values[1], 
                            shap_loss_type = 'negative', 
                            plot = True)

The Plotly figure is exported to a HTML document for preview.

In [31]:
fig.write_html(
    file = f'{OUT_FOLDER_PATH}/neg_model_loss_plot_full.html', 
    include_plotlyjs=True, 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=True, 
    auto_open=AUTO_OPEN, 
)

The Plotly figure is exported to a div element. A script tag that contains the JavaScript code to construct the Plotly figure in that div element is also exported as well.

In [32]:
fig_html = fig.to_html(
    include_plotlyjs='cdn', 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=False, 
)

print(f'The size of the HTML code is {round(len(fig_html.encode("utf-8"))/1000, 1)}KB.')
if PRINT_HTML:
    print(fig_html)
The size of the HTML code is 9.3KB.

Model monitoring plot

In [33]:
help(shap.monitoring_plot)
Help on function monitoring in module shap.plots._monitoring:

monitoring(ind, shap_values, features, feature_names=None, show=True)
    Create a SHAP monitoring plot.
    
    (Note this function is preliminary and subject to change!!)
    A SHAP monitoring plot is meant to display the behavior of a model
    over time. Often the shap_values given to this plot explain the loss
    of a model, so changes in a feature's impact on the model's loss over
    time can help in monitoring the model's performance.
    
    Parameters
    ----------
    ind : int
        Index of the feature to plot.
    
    shap_values : numpy.array
        Matrix of SHAP values (# samples x # features)
    
    features : numpy.array or pandas.DataFrame
        Matrix of feature values (# samples x # features)
    
    feature_names : list
        Names of the features (length # features)

In FYP2_Car_Price_Explainer.ipynb, the notebook discusses the disadvantage of the current model monitoring function and the enhancement of the model monitoring function. The notebook also conducts the validation of the both model monitoring functions under drifted & errored test set.

Setup 1: Build a tree explainer that uses interventional approach for train set.

Randomly choose 100 samples from the train set to provide as a background dataset for the explainer.

In [34]:
rng = np.random.default_rng(RANDOM_SEED)
idx_arr = rng.choice(range(len(ls_X_train)), num_subsamples)

# Randomly choose 100 samples from the train set
ls_X_train_subsample = ls_X_train.iloc[idx_arr, :].copy()

tree_loss_explainer_train = shap.TreeExplainer(
    model = ls_arf_dict, 
    feature_perturbation = 'interventional', 
    model_output = 'log_loss',
    data = ls_X_train_subsample)

"""
# If the additivity check failed in TreeExplainer, try reducing the sample used for computing SHAP values,
# as suggested at https://github.com/slundberg/shap/issues/941#issuecomment-879911858

train_sample_limit = 3000
ls_train_subsample = pd.read_csv('outputs/lead_scoring/ls_train_subsample.csv')
ls_X_train_subsample = ls_train_subsample.drop(columns='converted', axis=0)
ls_y_train_subsample = ls_train_subsample['converted'].copy()
"""

ls_shap_loss_values_train = tree_loss_explainer_train.shap_values(ls_X_train, ls_y_train)

Setup 2: Build a tree explainer that uses interventional approach for test set.

The tree explainer that uses interventional approach has already been initialized for test set in the "Tree Explainer Initialization" section.

The monitoring plots on SHAP loss values of Total Site Visit, Average Page View Per Visit, and Occupation_Working Professional are constructed using Lundberg's function as shown below. The presence of a vertical line that separates the two partitions, like the one shown in the monitoring plot for Total Site Visit, indicates that there is statistically significant mean difference between those partitions. All the plots show that there is a noticeable SHAP loss values gap between the validation set and the normal test set. It can be inferred that the model is not able to generalize very well to the test data due to overfitting issues.

In [35]:
modified_features = ['total_site_visit', 'avg_page_view_per_visit', 'occupation_Working Professional']

for feature_name in modified_features:
    feature_idx = ls_X_train_subsample.columns.get_loc(feature_name)
    shap.monitoring_plot(
        ind=feature_idx, 
        shap_values=np.vstack((ls_shap_loss_values_train[1], 
                               ls_shap_loss_values[1])), 
        features= pd.concat((ls_X_train, ls_X_test_truth_av)), 
        feature_names=ls_X_train.columns
    )

The monitoring plots on SHAP loss values of Total Site Visit, Average Page View Per Visit, and Occupation_Working Professional are constructed using Plotly as shown below. As compared to Lundberg's function, the proposed function does not raise alarm on Total Site Visit. This is expected since the proposed function starts the evaluation from validation set and the starting partition is the validation set. On the other hand, Lundberg's function starts the partitioning from the beginning of the samples and the starting partition is not customizable. In addition, unlike Lundberg's function, the proposed function raises alarm on Average Page View Per Visit, which is triggered by Welch's t-test. It can be inferred that the likelihood of alarms triggered is higher in proposed function as compared to Lundberg's function. This can lead to higher true positive alarm triggers and lower false negatives, but at the cost of higher false positive alarm triggers.

In [36]:
shap_monitoring_figs = []

for feature_name in modified_features:
    fig, alarm_info = plot_shap_monitoring_plot_plotly(
        feature_name = feature_name,
        shap_values_list = [
            ls_shap_loss_values_train[1], 
            ls_shap_loss_values[1]
        ],
        features_list = [
            ls_X_train, 
            ls_X_test_truth_av
        ], 
        feature_names = ls_X_train.columns.tolist(), 
        feature_label_names = feature_label_names
    )
    fig.update_traces(
        hovertemplate = \
        f'<i>{feature_label_names[feature_name]}</i>: '+
        '%{customdata:.0f}<br>'+
        '<i>Sample index</i>: %{x:.0f}<br>' +
        '<i>SHAP loss value</i>: %{y:.4f}<br>'
    )
    shap_monitoring_figs.append(fig)
    if 'message' in alarm_info:
        print(alarm_info['message'])
    else:
        print('No alarm sounded.')
    fig.show()
No alarm sounded.
At 5419, Welch's t-test detects feature values' mean difference with a p-value of 0.02506
No alarm sounded.

The Plotly figure is exported to a HTML document for preview.

In [37]:
shap_monitoring_figs[0].write_html(
    file = f'{OUT_FOLDER_PATH}/shap_loss_monitoring_plot_full.html', 
    include_plotlyjs=True, 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=True, 
    auto_open=AUTO_OPEN, 
)

The Plotly figure is exported to a div element. A script tag that contains the JavaScript code to construct the Plotly figure in that div element is also exported as well.

In [38]:
total_bytes = 0

for fig in shap_monitoring_figs:
    fig_html = fig.to_html(
        include_plotlyjs='cdn', 
        default_width='100%', 
        default_height='100%', 
        validate=True, 
        full_html=False, 
    )
    total_bytes += len(fig_html.encode("utf-8"))

print(f'The average size of the HTML code is {round(total_bytes/len(shap_monitoring_figs)/1000, 1)}KB.')
if PRINT_HTML:
    print(fig_html)
The average size of the HTML code is 234.7KB.

Implementation

It takes around 20 seconds to run the SHAP loss monitoring function if the full train set (5219 samples) is used as the validation set.

To reduce the computation time:

  1. Since the application data is around 1305 samples, only 1400 samples out of 5219 samples in the train set are used as the validation set.
  2. Note that the result might not be accurate since the variance increases as less samples are used in the validation set.
In [39]:
# Only 1400 samples in the train set are used as the validation set
MAX_TRAIN_SAMPLES = 1400

tree_loss_explainer_train = shap.TreeExplainer(
    model = ls_arf_dict, 
    feature_perturbation = 'interventional', 
    model_output = 'log_loss',
    data = ls_X_train_subsample)

ls_shap_loss_values_train_sub = tree_loss_explainer_train.shap_values(
    ls_X_train[:MAX_TRAIN_SAMPLES], 
    ls_y_train[:MAX_TRAIN_SAMPLES]
)

The result below shows that two alarms are triggered. As mentioned in FYP2_Car_Price_Explainer,ipynb, one way to reduce false positive alarms is by using a more stricter p-value to reduce false positives.

In [40]:
start = time.time()

shap_monitoring_fig_htmls = []
alarms = []
problematic_cols = []
alarm_types = []

for idx, feature_name in enumerate(ls_X_train.columns.tolist()):
    # Call the SHAP monitoring function
    _, alarm_info = get_shap_monitoring_plot_data(
        feature_name = feature_name,
        shap_values_list = [
            ls_shap_loss_values_train_sub[1],
            ls_shap_loss_values[1]
        ],
        features_list = [
            ls_X_train[:MAX_TRAIN_SAMPLES], 
            ls_X_test_truth_av
        ], 
        feature_names = ls_X_train.columns.tolist(), 
        increment = 50
    )
    alarms.append(alarm_info)
    if 'p-value' in alarm_info:
        # If the alarm is triggered
        print(f'\nDrift/data error detected at top {idx+1} feature, {feature_name}:')
        print(f'\t{alarm_info["message"]}')
        # Save the alarm information
        problematic_cols.append(feature_name)
        alarm_types.append(alarm_info['type'])

end = time.time()
print(f'\nExecution time: {end - start}seconds')
Drift/data error detected at top 3 feature, avg_page_view_per_visit:
	At 1600, Welch's t-test detects feature values' mean difference with a p-value of 0.03455

Drift/data error detected at top 7 feature, occupation_Unemployed:
	At 2050, Welch's t-test detects feature values' mean difference with a p-value of 7.249e-73

Execution time: 1.4453849792480469seconds

Local Explanation

Bar plot

There are various plot types like waterfall plot, force plot, bar plot, decision plot and more that is useful to provide explanation at a sample level. In comparions with other plot types, bar plot is chosen since the plot can present the local explanation very clearly to the audience when the number of features displayed is high.

As shown below, the bar plot can visualize both SHAP values and SHAP loss values. In web application, SHAP values are visualized when the users request for explanation on the lead scoring prediction. SHAP loss values are visualized when the data scientists want to know which features contribute to the prediction error and which features help reducing the prediction error.

In the bar plots, the feature with the highest absolute SHAP values is displayed at the top, followed by the next highest absolute SHAP values in a top-down fashion. In both plots, the red bar represents positive SHAP values while the blue bar represents negative SHAP values. However, the red and blue bar presents different meaning in both plots, respectively. For bar plot visualizing SHAP values, the red bar means the positive contribution of the feature to the difference between the current predicted price and expected price, while the blue bar means the negative contribution. For bar plot visualizing SHAP loss values, the red bar means that the feature contribute to the prediction error while the blue bar means that the feature help reducing the prediction error.

The data scientists can derive more useful insights when comparing both bar plots side by side. The examples are shown in the following cells.

This example shows an partially accurate prediction. For this sample, avg_page_view_per_visit and total_site_visit are the top two features that contribute the most in the lead scoring prediction. Four out of eight features contributes to the errors in the lead scoring prediction. Nevertheless, the prediction is still accurate since the avg_page_view_per_visit and total_site_visit significantly reduce the prediction error to compensate the positive SHAP loss value contributions from other features. In other words, the prediction is accurate since most of the features negatively contribute to the prediction error, as indicated by the higher number of longer blue bars as compared to red ones.

In [41]:
print('\nLocal explanation for lead scoring prediction:')
shap.bar_plot(ls_shap_values[1][405], feature_names=ls_data_pp.features, max_display=10)
print('\nLocal explanation for lead scoring prediction error:')
shap.bar_plot(ls_shap_loss_values[1][405], feature_names=ls_data_pp.features, max_display=10)
Local explanation for lead scoring prediction:
Local explanation for lead scoring prediction error:

This example shows an inaccurate prediction. For this sample, total_time_spend_on_site and occupation_Unemployed are the top two features that contribute the most in the lead scoring prediction. Out of the top eight features, the avg_page_view_per_visit and received_free_copy_Yes are the only features that are not contributing to the errors in the lead scoring prediction. The prediction is inaccurate since most of the features positively contribute to the prediction error, as indicated by the higher number of longer red bars as compared to blue ones. Since the current sample is the test sample, it is possible that the model is not sufficiently trained with samples that represent this sample. It is also possible that the sample is a drifted sample, errored sample, or a malicious sample. Further investigation is required but the investigation is not covered since it is out of scope.

In [42]:
print('\nLocal explanation for lead scoring prediction:')
shap.bar_plot(ls_shap_values[1][381], feature_names=ls_data_pp.features, max_display=10)
print('\nLocal explanation for lead scoring prediction error:')
shap.bar_plot(ls_shap_loss_values[1][381], feature_names=ls_data_pp.features, max_display=10)
Local explanation for lead scoring prediction:
Local explanation for lead scoring prediction error:

Below are the bar plots constructed using Plotly.

In [43]:
indexes = [3, 381]

local_shap_bar_figs = []
local_shap_loss_bar_figs = []

for index in indexes:
    print(f'Local explanation for lead scoring prediction at {index+1}th sample:')
    fig = plot_local_bar_plot_plotly(
        shap_values = ls_shap_values[1][index],
        features = ls_X_test_truth_av.values[index, :],
        feature_names = ls_X_test_truth_av.columns.tolist(), 
        feature_label_names = feature_label_names
    )
    fig.update_traces(
        hovertemplate = '<i>Feature:</i> %{y}<br>'+
        '<i>Feature value</i>: %{customdata:.0f}<br>'+
        '<i>SHAP values</i>: %{x:.4f}<br>',
    )
    fig.show()
    
    print(f'Local explanation for lead scoring prediction error at {index+1}th sample:')
    loss_fig = plot_local_bar_plot_plotly(
        shap_values = ls_shap_loss_values[1][index],
        features = ls_X_test_truth_av.values[index, :],
        feature_names = ls_X_test_truth_av.columns.tolist(), 
        feature_label_names = feature_label_names, 
        shap_type = 'SHAP loss value'
    )
    loss_fig.update_traces(
        hovertemplate = '<i>Feature:</i> %{y}<br>'+
        '<i>Feature value</i>: %{customdata:.0f}<br>'+
        '<i>SHAP loss values</i>: %{x:.4f}<br>',
    )
    loss_fig.show()
    
    local_shap_bar_figs.append(fig)
    local_shap_loss_bar_figs.append(loss_fig)
Local explanation for lead scoring prediction at 4th sample:
Local explanation for lead scoring prediction error at 4th sample:
Local explanation for lead scoring prediction at 382th sample:
Local explanation for lead scoring prediction error at 382th sample:

The Plotly figure is exported to a HTML document for preview.

In [44]:
local_shap_bar_figs[0].write_html(
    file = f'{OUT_FOLDER_PATH}/local_shap_bar_plot_full.html', 
    include_plotlyjs=True, 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=True, 
    auto_open=AUTO_OPEN, 
)

local_shap_loss_bar_figs[0].write_html(
    file = f'{OUT_FOLDER_PATH}/local_shap_loss_bar_plot_full.html', 
    include_plotlyjs=True, 
    default_width='100%', 
    default_height='100%', 
    validate=True, 
    full_html=True, 
    auto_open=AUTO_OPEN, 
)

The Plotly figure is exported to a div element. A script tag that contains the JavaScript code to construct the Plotly figure in that div element is also exported as well.

In [45]:
total_bytes = 0

for fig in local_shap_bar_figs:
    fig_html = fig.to_html(
        include_plotlyjs='cdn', 
        default_width='100%', 
        default_height='100%', 
        validate=True, 
        full_html=False, 
    )
    total_bytes += len(fig_html.encode("utf-8"))

print(f'The average size of the HTML code is {round(total_bytes/len(shap_monitoring_figs)/1000, 1)}KB.')
if PRINT_HTML:
    print(fig_html)
The average size of the HTML code is 6.4KB.
In [46]:
total_bytes = 0

for fig in local_shap_loss_bar_figs:
    fig_html = fig.to_html(
        include_plotlyjs='cdn', 
        default_width='100%', 
        default_height='100%', 
        validate=True, 
        full_html=False, 
    )
    total_bytes += len(fig_html.encode("utf-8"))

print(f'The average size of the HTML code is {round(total_bytes/len(shap_monitoring_figs)/1000, 1)}KB.')
if PRINT_HTML:
    print(fig_html)
The average size of the HTML code is 6.4KB.

Thank you for reading.